Nombre: Jaime
Apellidos: de Clemente Fernández-Picazo
Tiempo: 2 horas y 30 minutos
Un banco portugues desea entender en más detalle las campañas de marketing directo que ha realizado en los últimos meses a más de 40 mil clientes. Las campañas de marketing se basaron en llamadas telefónicas. Muchas veces era necesario más de un contacto con un mismo cliente, para saber si el producto (depósito bancario) sería ('sí') o no ('no') contratado.
El objetivo del análisis es buscar patrones para entender mejor que tipo de perfil tienen los clientes que han contratado el depósito para buscar en su base de datos otros clientes parecidos para aumentar la respuesta y el ROI de futuras campañas de marketing directo intentando vender el mismo depósito. Por lo tanto, se pide:
Realizar un análisis descriptivo de los datos con al menos, 6 visualizaciones diferentes. (3 Puntos) (*)
Montar un dashboard con al menos, 4 visualizaciones diferentes, que incluyan 2 componentes interactivas y 1 callback (5 Puntos) (*)
Concluir todo este análisis haciendo recomendaciones para la mejora de futuras campañas de contacto directo a partir de los resultados obtenidos de los análisis realizados con los datos. (2 Puntos)
Para realizar este análisis se provee de un juego de datos con las siguientes variables:
Recuerda, si tuvieras que programar una función, comenta los argumentos de entrada y salida. Explica el orden que estás siguiendo a la hora de elegir las visualizaciones y comenta las conclusiones que vas sacando.
(*) IMPORTANTE: Puedes elegir realizar un modelo de clasificación y realizar visualizaciones en torno a ese modelo en los primeros dos apartados. Esta parte no es obligatoria. El objetivo de la clasificación sería predecir si el cliente se suscribirá a un depósito bancario (variable y).
import pandas as pd
import numpy as np
import sklearn.datasets
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import (classification_report, mean_squared_error, mean_absolute_error,
silhouette_score, confusion_matrix, ConfusionMatrixDisplay)
from sklearn.linear_model import ElasticNet
from sklearn.cluster import KMeans
import plotly.figure_factory as ff
from flask import Flask, render_template, request, session, redirect, url_for
import plotly.express as px
import plotly.graph_objects as go
#from prophet import Prophet
import statsmodels.api as sm
import secrets
df = pd.read_csv("./bank-full.csv", sep=";")
df.head()
def get_basics(df):
nas = df.isna().sum()
types = df.dtypes
description = df.describe()
description.loc['NAs'] = nas
description.loc['dataType'] = types
return description
get_basics(df)
# Descriptivo 1: Distribucion de la educacion de los clientes
# Create a Plotly figure for a pie chart
explode = [0, 0.1, 0, 0]
values = df.groupby('education').apply(len)/len(df)
labels = values.index
print(values)
print(labels)
fig1 = go.Figure()
print(fig1)
fig1.add_trace(go.Pie(
labels=labels,
values=values,
textinfo='percent+label',
hoverinfo='label+percent',
pull=explode
))
# Update layout
fig1.update_layout(
title='Distribution of clients by education level',
)
# Show the plot
fig1.show()
Most of the clients have only reached secondary education
# Descriptivo 2: Distribucion etaria de los clientes
# Create histograms using Plotly
histogram = go.Histogram(x=df['age'], nbinsx=20, opacity=0.7, name='Clients age')
# Create a Plotly figure
fig2 = go.Figure(data=histogram)
# Update layout
fig2.update_layout(
title='Distribution of clients by age',
xaxis=dict(title='Age'),
yaxis=dict(title='Num. of clients'),
showlegend=True
)
# Show the plot
fig2.show()
Most of our clients are young, as the mean of the age is 40 years old. However, they have enough age for them to have finished more than secondary, so it seems most of them already are in the educational level they are seeking.
# Descriptivo 3: Balance según trabajo
len(df['job'].unique())
df['id'] = range(0, len(df))
data = []
for job in df['job'].unique():
df_group = df[df['job'] == job]
trace = go.Scatter(x=df_group['id'],
y=df_group['balance'],
mode='markers',
name=job)
data.append(trace)
# Layout of the plot
layout = go.Layout(
title='Balance in account by job',
xaxis=dict(title='Client ID'),
yaxis=dict(title='Balance in account'),
showlegend=True
)
fig3 = go.Figure(data=data, layout=layout)
# Show the plot
fig3.show()
There are some clients that have a high balance, mostly white-collar jobs, but most of the clients have a balance under 20k.
# Descriptivo 4: Porcentaje de clientes por trabajo
fig4 = go.Figure(go.Bar(
x=df['job'].unique(),
y=df.groupby('job').apply(len)/len(df),
marker_color='mediumseagreen'
))
# Update layout
fig4.update_layout(
title='Share of clients by job',
xaxis=dict(title='Type of job'),
yaxis=dict(title='Percentage of clients'),
bargap=0.2,
)
fig4.show()
The biggest group are technicians and, excluding those whose job we do not know, the next group is unemployed.
# Descriptivo 5: Predicción de si cogen o no el servicio
# Dividir train-test
import random
random.seed(10)
for col in df.columns:
if df[col].dtype != 'int64':
df[col] = pd.get_dummies(df[col])
X = df.drop(['y', 'id'], axis=1)
y = df["y"]
X_train, X_test, y_train, y_test = train_test_split(X, y ,test_size = 0.3, random_state = 123)
print(len(X_train), len(X_test))
X
# Entrenamiento del modelo
svm = LinearSVC(C = 0.1)
svm.fit(X_train, y_train)
#Evaluamos la clasificacion
predictions = svm.predict(X_test)
print(classification_report(y_test, predictions))
cm = confusion_matrix(y_test, predictions)
class_names = [0, 1]
# Create annotated heatmap
heatmap = ff.create_annotated_heatmap(z=cm,
x=class_names,
y=class_names,
colorscale='Viridis')
# Update layout
heatmap.update_layout(title='Confusion Matrix',
xaxis=dict(title='Predicted service'),
yaxis=dict(title='True service'))
# Show the plot
heatmap.show()
# Descriptivo 6: Matriz de correlacion
# Create a correlation matrix
df = df.drop('id', axis=1)
correlation_matrix = df.corr().round(2)
# Extract column names
columns = correlation_matrix.columns.tolist()
# Create a Plotly heatmap
heatmap = ff.create_annotated_heatmap(z=correlation_matrix.values,
x=columns,
y=columns,
colorscale='Viridis')
# Update layout
heatmap.update_layout(
title='Correlation Heatmap',
xaxis=dict(title='Features'),
yaxis=dict(title='Features'),
)
# Show the plot
heatmap.show()
The most relevant variables to predict the service are duration, housing and contact.
To design future campaigns, the first thing the company must look at is a definition of what our client is. In this case, as can be seen throughout the graphs above, our model client would be someone that has up to secondary education, is below 40 years old, has less than 20k in their bank account and is probably a technician or unemployed. This already gives a lot of information as to how to target our clients.
Moreover, we can see that the variables we have are strong predictors as a whole, as we can create a model that is able to predict with a 90% precission when clients are going to ask for a loan. It is more difficult to predict whether the client will not ask for the loan though, whith a lower 60% precission.
To conclude, the most relevant variables that do ask as predictors are whether the client has housing, the duration of the call last time they were contacted and whether they have been contacted, so the company should aim to contact clients that are owners of property and seek to keep them at the phone the longest possible.